Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Optimization of Parquet Predicate Pushdown Capability #4608

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Aiden-Dong
Copy link

Purpose

Linked issue: #4586

优化了基于 Parquet 文件过滤读取时的谓词下推能力,将原先的Parquet 谓词下推由RowGroup级别增强到了 Column page 级别,查询性能提升明显。

Optimized the predicate pushdown capability for filtering and reading Parquet files, enhancing the original predicate pushdown from the RowGroup level to the Column Page level, resulting in a significant improvement in query performance.

Tests

API and Format

Documentation

@JingsongLi
Copy link
Contributor

Looks very nice! Thanks @Aiden-Dong , I will take a review next week.

@Aiden-Dong
Copy link
Author

我用[#4586] 提到测试样例,生成400万测试数据。
30次随机读取,原先需要12s-17s,优化后大概需要1s 左右。

Using the test case mentioned in [#4586], I generated 4 million test records.
With 30 random reads, the original implementation took around 12-17 seconds, while the optimized version reduced this to approximately 1 second.

before

image

after

image

@JingsongLi
Copy link
Contributor

@Aiden-Dong Can you add test for parquet page predicate pushdown and deletion vectors enabled? I just want to make sure currentRowPosition in ParquetReaderFactory still works good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants